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Abstract 

Time-domain astronomy (TDA) is facing a paradigm shift caused by the exponential growth of the 
sample size, data complexity and data generation rates of new astronomical sky surveys. For example, 
the Large Synoptic Survey Telescope (LSST), which will begin operations in northern Chile in 2022, 
will generate a nearly 150 Petabyte imaging dataset of the southern hemisphere sky. The LSST will 
stream data at rates of 2 Terabytes per hour, effectively capturing an unprecedented movie of the sky. 
The LSST is expected not only to improve our understanding of time-varying astrophysical objects, but 
also to reveal a plethora of yet unknown faint and fast-varying phenomena. To cope with a change of 
paradigm to data-driven astronomy, the helds of astroinformatics and astrostatistics have been created 
recently. The new data-oriented paradigms for astronomy combine statistics, data mining, knowledge 
discovery, machine learning and computational intelligence, in order to provide the automated and 
robust methods needed for the rapid detection and classification of known astrophysical objects as well 
as the unsupervised characterization of novel phenomena. In this article we present an overview of 
machine learning and computational intelligence applications to TDA. Future big data challenges and 
new lines of research in TDA, focusing on the LSST, are identified and discussed from the viewpoint 
of computational intelligence/machine learning. Interdisciplinary collaboration will be required to cope 
with the challenges posed by the deluge of astronomical data coming from the LSST. 


I. Introduction 

Time domain astronomy (TDA) is the scientific field dedicated to the study of astronomical 
objects and associated phenomena that change through time, such as pulsating variable stars, 
cataclysmic and eruptive variables, asteroids, comets, quasi-stellar objects, eclipses, planetary 
transits and gravitational lensing, to name just a few. The analysis of variable astronomical 
objects paves the way towards the understanding of astrophysical phenomena, and provides 
valuable insights in topics such as galaxy and stellar evolution, universe topology, and others. 

Recent advances in observing, storage, and processing technologies have facilitated the evolu¬ 
tion of astronomical surveys from observations of small and focused areas of the sky (MACHO 
[fn , EROS O, OGLE dll) to deep and extended panoramic sky surveys (SDSS 0, Pan-STARRS 
[|5l , CRTS [O). Data volume and generation rates are increasing exponentially, and instead of still 
images, future surveys will be able to capture digital “movies of the sky” from which variability 
will be characterized in ways never seen before. 

Several new grand telescopes are planned for the next decade [I7]|, among which is the Large 
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Synoptic Survey Telescope (LSST [[8|, [|9l) under construction in northern Chile and expected to 
begin operations by 2022. The word “synoptic” is used here in the sense of covering large areas 
of the sky repeatedly, searching for variable objects in position and time. The LSST will generate 
a 150 Petabyte imaging database, and a 40 Petabyte worth catalog associated with 50 billion 
astronomical objects during 10 years ifT^ . The resolution, coverage, and cadence of the LSST 
will help us improve our understanding of known astrophysical objects and reveal a plethora of 
unknown faint and fast-varying phenomena In addition, the LSST will issue approximately 2 
million alerts nightly related to transient events, such as supernovae, for which facilities around 
the world can follow up. 

To produce science from this deluge of data the following open problems need to be solved 
mil: a) real-time mining of data streams of ~ 2 Terabytes per hour, b) real-time classification 
of the 50 billion followed objects, and c) the analysis, evaluation, and knowledge extraction of 
the 2 million nightly events. 

The big data era is bringing a change of paradigm in astronomy, in which scientific advances 
are becoming more and more data-driven IIT^ . Astronomers, statisticians, computer scientists and 
engineers have begun collaborations towards the solution of the previously mentioned problems, 
giving birth to the scientific fields of astrostatistics and astroinformatics [fT^ . The development 
of fully-automated and robust methods for the rapid classification of what is known, and the 
characterization of emergent behavior in these massive astronomical databases are the main tasks 
of these new fields. We believe that computational intelligence, machine learning and statistics 
will play major roles in the development of these methods IIT^ . [fTOll . 

The remainder of this article is organized as follows: In section the fundamental concepts 
related to time-domain astronomy are defined and described. In section an overview of current 
computational intelligence (Cl) and machine learning (ML) applications to TDA is presented. 


In section IV future big data challenges in TDA are exposed and discussed, focusing on what 
is needed for the particular case of the LSST from an ML/CI perspective. Finally, in section |V] 
conclusions are drawn. 


IT Astronomical Background 

In this section we describe the basic concepts related to astronomical time series analysis and 
time-domain astronomical phenomena. Photometry is the branch of astronomy dedicated to the 
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precise measurement of visible electromagnetic radiation from astronomical objects. To achieve 
this, several techniques and methods are applied to transform the raw data from the astronomical 
instruments into standard units of flux or intensity. The basic tool in the analysis of astronomical 
brightness variations is the light curve. A light curve is a plot of the magnitude of an object’s 
electromagnetic radiation (in the visible spectrum) as a function of time. 

Light curve analysis is challenging, not only because of the sheer size of the databases, but 
also due to the characteristics of the data itself. Astronomical time series are unevenly sampled 
due to constraints in the observation schedules, telescope allocations and other limitations. When 
observations are taken from Earth the resulting light curves will have periodic one-day gaps. The 
sampling is randomized because observations for each object happen at different times every 
night. The cycles of the moon, bad weather conditions and sky visibility impose additional 
constraints which translate into data gaps of different lengths. Space observations are also 
restricted as they are regulated by the satellite orbits. Discontinuities in light curves can also be 
caused by technical factors: repositioning of the telescopes, calibration of equipment, electrical, 
and mechanical failures, etc. 

Astronomical time series are also affected by several noise sources. These noise sources can 
be broadly categorized into two classes. The first class is related to observations, such as the 
brightness of closer astronomical objects, and atmospheric noise due to refraction and extinction 
phenomena (scattering of light due to atmospheric dust). On the other hand, there are noise 
sources related to the instrumentation, in particular to the CCD cameras, such as sensitivity 
variations of the detector, and thermal noise. In general, errors in astronomical time series are 
non-Gaussian and heterocesdastic, i.e., the variance of the error is not constant, and changes 
along the magnitude axis. 

Other common problematic situations arising in TDA are the sample-selection bias and the lack 
of balance between classes. Generally the astrophysical phenomena of interest represents a small 
fraction of the observable sky, hence the vast majority of the data belongs to the “background 
class”. This is especially noticeable when the objective is to find unknown phenomena, a task 
known as novelty detection. Sufficient coverage and exhaustive labeling are required in order to 
have a good representation of the sample, and to assure capturing the rare objects of interests. 

In the following we briefly describe several time-domain astronomical phenomena emphasizing 
their scientific interest. We focus on phenomena that vary in the optical spectrum. Among the 
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Fig. 1. Variable star topological classification. 


“observable stars” there is a partieular group ealled the variable stars [[T4ll . [[T5]| . IIT^ . Variable 
stars eorrespond to stellar objeets whose brightness, as observed from Earth, fluctuates in time 
above a certain variability threshold defined by the sensitivity of the instruments. Variable star 
analysis is a fundamental pivot in the study of stellar structure and properties, stellar evolution 
and the distribution and size of our Universe. The major categories of variable stars are briefly 
described in the following paragraphs with emphasis on the scientific interest behind each of 
them. For a more in-depth definition of the objects and their mechanisms of variability, the 
reader can refer to m- The relation between different classes of variable stars is summarized 
by the tree diagram shown in Fig. ifT^ . [fTTl . 

The analysis of intrinsic variable stars is of great importance for the study of stellar nuclei 
and evolution. Some classes of intrinsic variable stars can be used as distance markers to study 
the distribution and topology of the Universe. Cepheid and RR Fyrae stars [fTSll (Fig. 2a) are 
considered standard candles because of the relation between their pulsation period and their 
absolute brightness. It is possible to estimate the distance from these stars to Earth with the 
period and the apparent brightness measured from the telescope ifTSll . Type lA Supernovae ifTSll 
are also standard candles, although they can be used to trace much longer distances than Cepheids 
and RR Fyrae [fT^ . The period of eclipsing binary stars [fT5l (Fig. |2^ is a key parameter in 
astrophysics studies as it can be used to calculate the radii and masses of the components 
Fight curves and phase diagrams of periodic variable stars are shown in Fig. 
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(a) (b) 

Fig. 2. (a) Light curve of a pulsating variable star (upper left panel), such as a Cepheid or RR Lyrae. The star pulsates 

periodically changing in size, temperature and brightness which is reflected on its light curve, (b) Light curve of eclipsing binary 
star (upper right panel). The lower panels show the geometry of the binary system at the instants where the eclipses occur. The 
periodic pattern in the light curve is observed because the Earth (X axis) is aligned with the orbital plane of the system (Z axis). 


III. Review of Computational Intelligence Applications in TDA 

Time-domain astronomers are faced with a wide array of scientific questions that are related 
to the detection, identification and modeling of variable phenomena such as those presented in 
the previous Section. We may classify these problems broadly as follows: 

i Extract information from the observed time series in order to understand the underlying 
processes of its source. 

ii Use previous knowledge of the time-varying universe to classify new variable sources auto¬ 
matically. How do we characterize what we know? 

iii Find structure in the data. Find what is odd and different from everything known. How do 
we compare astronomical objects? What similarity measure do we use? 

The computational intelligence and machine learning fields provide methods and techniques 
to deal with these problems in a robust and automated way. Problem i is a problem of modeling, 
parametrization and regression (kernel density estimation). Problem ii corresponds to supervised 
classification (artificial neural networks, random forests, support vector machines). Problem iii 
deals with unsupervised learning, feature space distances and clustering {k nearest neighbors, self- 
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Fig. 3. Light curve and phase diagram of an RR Lyrae (a), Cepheid (b), Mira (c) and eclipsing binary star (d), respectively. 
The phase diagram is obtained using the underlying period of the light curves and the epoch folding transformation IB). If the 
folding period is correct a clear profile of the periodicity will appear in the phase diagram. 


organizing maps). The correct utilization of these methods is key to dealing with the deluge of 
available astronomical data. In the following section we review particular cases of computational 
intelligence based applications for TDA. 

A. Periodic Variable Star Discrimination 

We begin this review with a case of parameter estimation from light curves using information 
theoretic criteria. Precise period estimations are fundamental in the analysis of periodic vari¬ 
able stars and other periodic phenomena such as transiting exoplanets. In [1211 the correntropy 
kernelized periodogram (CKP), a metric for period discrimination for unevenly sampled time 
series, was presented. This periodogram is based on the correntropy function [l22l . an information 
theoretic functional that measures similarity over time using statistical information contained in 
the probability density function (pdf) of the samples. In [1211 the CKP was tested on a set of 5,000 
light curves from the MACHO survey [[Hi previously classified by experts. The CKP achieved a 
true positive rate of 97% having no false positives and outperformed conventional methods used 
in astronomy such as the Lomb-Scargle periodogram [|2^ . ANOVA and string length. 

In 1(24)1 the CKP was used as the core of a periodicity discrimination pipeline for light curves 
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from the EROS-2 survey The method was ealibrated using a set of 100,000 synthetic light 
curves generated from multivariate models constructed following the EROS-2 data. Periodicity 
thresholds and rules to adapt the kernel parameters of the CKP were obtained in the calibration 
phase. Approximately 32 million light curves from the Earge and Small Magellanic clouds 
were tested for periodicity. The pipeline was implemented for GPGPU architectures taking 18 
hours to process the whole EROS-2 set on a cluster with 72 GPUs. A catalog of 120 thousand 
periodic variable stars was obtained and cross-matched with existing catalogs for the Magellanic 
clouds for validation. The main contributions of ll24ll are the procedure used to create the 
training database using the available survey data, the fast implementation geared towards large 
astronomical databases, the large periodic light curve catalog generated from EROS-2, and the 
valuable inference on the percentage of periodic variable stars. 

Another example of information theoretic concepts used for periodicity detection in light 
curves can be found in [1251 . In this work the Shannon’s conditional entropy of a light curve is 
computed from a binned phase diagram obtained for a given period candidate. The conditional 
entropy is minimized in order to find the period that produces the most ordered phase diagram. 
The proposed method was tested using a training set of periodic light curves from the MACHO 
survey and the results show that it is robust against systematic errors produced by the sampling, 
data gaps, aliasing and artifacts in phase space. 

In Tagliaferri et al. [|26l . neural networks are used to obtain the parameters of the periodogram 
of the light curve. These parameters are then fed into the MUSIC (Multiple Signal Classification) 
to generate a curve whose peaks are located in the periods sought. Interestingly, this work also 
shows the relation between the presented method and the Cramer-Rao bound, thus posing absolute 
practical limits to the performance of the proposed procedure. 

A comprehensive analysis of period finding algorithms for light curves can be found in [|T7l . 
In this work classical methods for period discrimination in astronomy such as the Eomb-Scargle 
periodogram and Phase Dispersion minimization are compared to novel information theoretic 
criteria [l251l . The authors note that the accuracy of each individual method is dependent on 
observational factors and suggest that an ensemble approach that combines several algorithms 
could mitigate this effect and provide a more consistent solution. How to combine the output of 
different methods and the increased computational complexity are key issues to be solved. 
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B. Automated Supervised Classification for Variable Objects 

After obtaining the period, supervised methods can be used to discriminate among the known 
classes of periodic variable stars. In supervised learning, prior information in the form of a 
training dataset is needed to classify new samples. The creation and validation of these training 
sets are complex tasks in which human intervention is usually inevitable. This is particularly 
challenging in the astronomical case due to the vast amounts of available data. If the data do not 
initially represent the population well, then scientific discovery may be hindered. In addition, 
due to the differences between observational surveys, it is very difficult to reuse the training sets. 
In the following paragraphs several attempts of supervised classification for TDA are reviewed, 
with emphasis on the classification scheme and the design of the training databases. 

Gaussian mixture models (GMMs), and artificial neural networks trained through Bayesian 
model averaging (BAANN) were used to discriminate periodic variable stars from their light 
curves in [[28]| . Classification schemes using single multi-class and multi-stage (hierarchical) clas¬ 
sifiers were compared. This work is relevant not only because of the application and comparison 
between the algorithms, but also because of the extended analysis performed in building the 
training dataset. First, well-known class prototypes were recovered from published catalogs and 
reviewed. Data from nine astronomical surveys were used, although the vast majority came from 
the Hipparcos and OGLE projects. A diagram of the classification pipeline for the hierarchical 


classifier is shown in Fig. 4a The selected variability types were parametrized using harmonic 
fitting (Lomb-Scargle periodogram). The classes were organized in a tree-like structure similar to 
the one shown in Fig. The final training set contained 1,732 samples from 25 well-represented 
classes. For the single stage classifier the GMM and BAANN obtained correct classification 
rates of 69% and 70%, respectively. Only the BAANN was tested using the multi-stage scheme 
obtaining a correct classification rate of 71%. According to the authors, the GMM provides a 
simple solution with direct astrophysical interpretation. On the other hand, some machine learning 
algorithms may achieve lower misclassification rates but their interpretability is reduced. The 
authors also state the need for higher statistical knowledge, which can be provided through 
interdisciplinary cooperation, in order to use the machine learning approach. A more recent 
version of this method can be found in ||29ll . In this work 26,000 light curves from TrES and 
Kepler surveys were classified using a multi-stage tree of GMMs. The main difference from the 
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Fig. 4. (a) Classification scheme used in 1281 for variable star classification, (b) Light curve processing pipeline used in 1301 . 

A candidate period is used to obtain a phase diagram of the light curve which is then smoothed, interpolated to a fixed grid, 
and binned. The period, magnitude, color and binned values are used as features for the SVM classifier. 


previous version is the eareful seleetion of signifieant frequeneies and overtone features whieh 
reduces confusion between classes. 

In OOl . 14,087 periodic light curves from the OGLE survey were used to train and test 
supervised classifiers based on fc-NN and SVM. The periods and labels were obtained directly 
from the OGLE survey. The following periodic variable classes were considered: Cepheids, RR 
Lyrae and Eclipsing Binaries. The period, average brightness and color of the light curves were 
used as features for the classifier. In addition, the authors included the phase diagram of the 
light curve as a feature and proposed a kernel to compare time series, which is plugged into 
the SVM. The phase diagram is obtained using the underlying period of the light curves and 
the epoch folding transformation IfTSl . The light curve processing pipeline is shown in Eig. 


4b The proposed kernel takes care of the possible difference in phase between time series. 


Using the shape of the light curve and the proposed kernel, correct classification rates close 
to 99% were obtained. Intuitively, the shape of the periodicity (phase diagram) should be a 
strong feature for periodic variable star classification. The authors note that a complete pipeline 
would require first discriminating whether the light curve is periodic, and estimating its period 
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with high accuracy. Wrongly estimated periods would alter the folded light eurve, affeeting the 
elassifieation performanee. 

A Random Forest (RF) elassifier for periodie variable stars from the Hippareos survey was 
presented in [IBTI . A set of 2,000 reliable variable sourees found in the literature was used to train 
the elassifier to diseriminate 26 types of periodie variable stars. Non-periodie variables were also 
added to the training set. Light eurves were eharaeterized using statistical moments, periods and 
Fourier eoeffieients. The performanee of the elassifier is eonsistent with other studies [|28| . The 
authors found that the most relevant features in deereasing order are the period, amplitude, eolor 
and the Fourier eoeffieients (light eurve model). The authors also found that the major sourees of 
miselassifieation are related to the reliability of the estimated periods and the misidentifieation 
of non-periodie light eurves as periodie. 

Statistical classifiers work under the assumption that class probabilities are equivalent for the 
training and testing sets. Aeeording to [|^ this assumption may not hold when the number 
of light eurve measurements is different between sets. In this problem is addressed via 
noisification and denoisification, i.e., trying to modify the pdf of the training set so that it 
mimies the test set, and to infer the elass of a poorly sampled time series aeeording to its most 
probable evolution. This seheme is tested on light eurves from the OGLE survey. Results show 
that noisifieation and denoisifieation improve the classification accuracy for poorly sampled time 
series by 20%. The authors note that the proposed method may help overeome other systematie 
differenees between sets sueh as varying eadenees and noise properties. 

elassifieation of non-periodic variable sourees is less developed than periodie souree elas¬ 
sifieation. Non-periodie sourees in general are more diffieult to eharaeterize, whieh is why 
only a few studies do general Aetive Galaetie Nuelei (AGN) elassifieation, instead foeusing on 
diseriminating a partieular type of quasi-stellar objeet (QSO). In [1^ a supervised elassifieation 
seheme based on SVM was used to diseriminate AGN from their light eurves. The objeets were 
eharaeterized using 11 features ineluding amplitude, eolor, autoeorrelation funetion, variability 
index and period. A training set of ~ 5,000 objects including non-variable and variable stars 
and 58 known quasars from the MACHO survey was used to train the elassifier. This work li3^ 
differs from previous attempts at AGN diserimination in its thoughtful study of the effieieney 
and false positive rates of the model and elassifier. The classifier was tested on the full 40 mi llion 
MACHO light eurves finding 1,620 QSO eandidates that were later eross-matehed with external 
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quasar catalogs, confirming the validity of the candidates. 

In [l34ll . a multi-elass elassifier for variable stars was proposed and tested. This implementation 
shares the features, data, and the 25-class taxonomy used in [|2^ . allowing direct comparison. 
An important distinction with respect to [|2E1l is that the classification is extended to include 
non-periodie eruptive variable^ Two elassifieation sehemes are tested: In the first, pairwise 
comparisons between two-class classifiers are used, and in the second, a hierarchical classifi¬ 
cation scheme based on the known taxonomy of the variable stars is employed (Fig. [^. The 
best performanee is obtained by the RF with pairwise elass eomparisons, aehieving a eorreet 
classification rate of 73.3% when using the same features as |j28l, and 77.2% when only the 
more relevant features are used. In a taxonomical sense, a mistake committed in the first tier 
of the elassifieation hierarehy (eatastrophie error) is more severe than a mistake in the final 
tiers (sub-type elassifiers). The hierarehical RF implementation obtains a slightly worse overall 
performanee (1%) and a smaller eatastrophie error (8%) than the pairwise RF. Although there 
is a eonsiderable improvement with respeet to the aeeuraey is not high enough for fully 
automated elassifieation. The taxonomical classification of variable stars and their multiple sub- 
types is still an open problem. 

Using information from several eatalogs may improve the eharaeterization of the objeets under 
study, but joining catalogs is not a trivial task as different surveys use different instruments 
and may be installed in totally different loeations. Intersecting the eatalogs, i.e., removing 
eolumns/rows with missing data may result in a database that is smaller than the original single 
eatalogs. A variable star classifier for ineomplete catalogs was proposed in [l35ll . In this work 
the structure of a Bayesian network is learned from a joined catalog with missing data from 
several surveys. The joined eatalog is “filled” by estimating the probability distributions and 
dependencies between the features through the Bayesian network. The resulting training set 
has 1,833 samples including non-variables, non-periodic variables (quasars and Be stars), and 
periodie variable stars (Cepheids, RR Lyrae, EB and Long Period Variables). An RF elassifier 
is trained with the joined eatalog. The Bayesian network is compared to a second approach 
for filling missing data based on GMM, obtaining better elassifieation aeeuraey for most of 
the seleeted elasses. An additional test on 20 million light eurves from the MACHO survey (a 

'simple statistical moments from the flux distribution are used as features for the eruptive variables. 
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catalog with no missing data) was performed. From this test a set of 1,730 quasar eandidates 
was obtained whieh eorresponds to a 15% improvement with respeet to previous quasar lists in 
the literature [l3^ . 

C. Unsupervised and Semi-supervised Learning in Time-domain Astronomy 

In some eases previous information on the phenomena might be insufficient or non-existent, 
henee a training set eannot be eonstrueted. In these cases one may need to go baek one step and 
obtain this information from the data using unsupervised methods. In a broad sense the objeetive 
of unsupervised methods is to estimate the density funetion that generated the data revealing 
its strueture. One of the first referenees for unsupervised learning in astronomy is found in 
0^ where self-organizing maps (SOMs) were used to diseriminate elusters of stars in the solar 
neighborhood. One hundred thousand stars from the Hippareos eatalog, mixed with synthetie 
data, were used. The synthetic stars were modeled with partieular eharaeteristies of known 
stellar populations. The Hippareos eatalog provides information about the position, magnitude 
(brightness), eoloij^ speetral type and variability among other features. The SOM was trained 
on a 10x10 grid with the additional eonstraint that eaeh node should have at least one synthetie 
star. Using the synthetie stars as markers, elusters of stellar populations were reeognized in the 
visualization of the SOM. In this ease the SOM was used not only to find population elusters 
but also to validate the theoretical models used to create the synthetic stars. 

SOM was also used in ll37l in order to learn elusters of mono-periodie variable stars. The 
main objeetive was to classify periodic variable stars in the absenee of a training set, whieh was 
the reason the SOM was selected. The feature in this case was an A^-dimensional veetor obtained 
from the folded light eurves. Eaeh light eurve was folded with its period whieh was known a 
priori. The folded light eurves were then normalized in seale and diseretized in N bins. The 
number of bins and SOM topology parameters were ealibrated using five thousand synthetie light 
eurves representing four elasses of periodie variables. The SOM was then tested on a relatively 
small set of 1,206 light eurves. The clusters for eaeh elass were diseriminated using a U-matrix 
visualization and density estimations. Clusters assoeiated with Eelipsing Binaries, Cepheids, RR 

^The color corresponds to the difference in average brightness between two different spectra. 
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Lyrae and 5-scuti populations were identified. Although restrieted in size, this applieation shows 
the potential of the SOM for elass diseovery in time-domain astronomy. 

In [f38l a density based-approaeh for elustering was used to find groups of variable stars in the 
OGLE and CoRoT surveys. The light eurves are eharaeterized using the features deseribed in 
||2^ . Eaeh point in feature spaee is assigned to one of the elusters using Modal Expeetation Max¬ 
imization. This work addresses the need of tying up astronomieal knowledge with the outeome 
of the eomputational intelligenee algorithms. The manner in whieh astronomers have elassihed 
stellar objeets and events does not neeessarily eorrespond to that produeed by automated systems. 
Interestingly, this study establishes that there is another problem as well: the same eomputational 
intelligenee algorithms working on different databases produeed distinet elassiheation struetures, 
showing that even though these databases have large numbers of examples, they have inherent 
biases and may not be sufheiently large to allow the diseovery of general rules. This problem has 
also been reported in other helds, speeiheally artiheial vision [|^ . The work in [|^ showed that 
in order to produee eonsistent elassiheation performanees, one eould not simply use databases 
with hundreds of thousands of examples, it was neeessary to use elose to 80 million images, far 
exeeeding what was traditionally eonsidered enough by the praetitioners of the held. 

Kernel Prineipal Component Analysis (KPCA) was used in [|^ to perform speetral elustering 
on light eurves from the CoRoT survey. The light eurves were eharaeterized using three different 
approaehes: Eourier series, autoeorrelation funetions, and Hidden Markov Models (HMMs). 
Then, dimensionality was redueed with KPCA using the Gaussian kernel. Einally, the eigenvalues 
were used to hnd elusters of variable stars. This novel eharaeterization of light eurves permits 
identifying not only periodie variable stars eorreetly (Eourier and autoeorrelation features), but 
also irregular variable stars (HMM features). 

Unsupervised learning ean also be used for novelty deteetion, i.e., Ending objeets that are 
statistieally different from everything that is known and henee eannot be elassihed in one of the 
existing eategories. Astronomy has a long history regarding serendipitous diseovery llTO . i.e., 
to hnd the unexpeeted (and unsought). Computational intelligenee and maehine learning may 
provide the means for faeilitating the task of novelty deteetion. 

One may argue that the hrst step for novelty deteetion is to dehne a similarity metrie for 
astronomieal time series in order to eompare time-varying astronomieal objeets. This is the 
approaeh found in [|4^ where a methodology for outlier light eurve identiheation in astronomieal 
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catalogs was presented. A similarity metrie based on the eorrelation eoeffieient is eomputed 
between every pair of light eurves in order to obtain a similarity matrix. Intuitively, the outlier 
variable star will be dissimilar to all the other variables. Before any distanee calculation, light 
curves are interpolated and smoothed in order to normalize the number of points and time 
instants. For eaeh pair the lag of maximum eross-correlation is found in Fourier spaee, whieh 
solves the problem of comparison between light eurves with arbitrary phases. Finally the outliers 
correspond to the light curves with the lowest eross eorrelations with respect to each row of 
the distanee matrix. The method was tested on ~ 34,500 light curves from early-stage periodic 
variable star eatalogs originated from the MACHO and OGLE O surveys. The results of this 
process were lists of mislabeled variables with eareful explanations of the new phenomena 
and reasons why they were misclassified. Calculating the similarity matrix seales quadratically 
with the number of light eurves in the survey. The authors diseuss this issue and provide an 
approximation of the metric that reduces the computational complexity to 0{N). Also, an exact 
and efficient solution for distanee-based outlier detection can be found in [|43]|, whieh uses a 
discord metric that requires only two linear searehes to find outlier light curves as well. 

A different approach for novelty deteetion was given in Il44l . where an anomaly detection 
technique dubbed PCAD (Periodic Curve Anomaly Deteetion) was proposed and used to find 
outlier periodie variables in large astronomical databases. PCAD finds clusters using a modified 
fc-means algorithm called phased fc-means (j!?A:-means). This modification is required in order 
to compare asynehronous time series (arbitrary phases). By using a clustering methodology the 
authors were able to find anomalies in both a global and local sense. Loeal anomalies correspond 
to periodie variables that lie in the frontier of a given elass. Global anomalies on the other hand 
differ from all the clusters. Approximately 10,000 periodic light curves from the OGLE survey 
were tested with PCAD. The pre-proeessing of the light curves, the seleetion of features, and 
the eomputation of the eross-correlations follow the work of [|42l . The eross-correlation is used 
as a distance metrie for the pk-means. The results obtained by the method were then evaluated 
by experts and sorted as noisy light eurves, miselassified light curves and interesting outliers 
worthy of follow-up. 

A problem with purely unsupervised methods is that prior knowledge, when available, is not 
neeessarily used. Semi-supervised learning sehemes deal with the case where labels (supervised 
information) exist although not for all the available data. Semi-supervised methods are able to find 
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the structure of the data distribution, learn representations and then combine this information 
with what is known. Semi-supervised methods can also be used for novelty detection, with 
the benefit that they may improve their discrimination by automatically incorporating the newly 
extracted knowledge. The semi-supervised approach is particularly interesting in the astronomical 
case where prior information exists, although scarce in comparison to the bulk of available 
unlabeled data. In P31 a semi-supervised scheme for classification of supernova sub-types was 
proposed. In this work the unlabeled supernovae data are used to obtain optimal low-dimensional 
representations in an unsupervised way. A diagram of the proposed implementation is shown in 


Fig. 5a In general, features are extracted from supemovae light curves following fixed templates. 
The data-driven feature extraction proposed in [|^ performs better and is more efficient than 
template methods with respect to data utilization and scaling. 

A bias due to sample selection occurs when training and test datasets are not drawn from the 
same distribution. In astronomical applications, training datasets often come from older catalogs 
which, because of technological constraints, contain more information on brighter and closer 
astronomical objects, i.e., the training dataset is a biased representation of the whole. In these 
cases standard cross-validation procedures are also biased resulting in poor model selection and 
sub-optimal prediction. The selection bias problem is addressed from an astronomical perspective 
in [|46ll through the use of active learning (AL). A diagram of the implementation proposed in 


is shown in Fig. 5b In AL, the method queries the expert for manual follow-up of objects that 
cannot be labeled automatically. There is a natural synergy between AL and astronomy, because 
the astronomer is, in general, able to follow up a certain target in order to obtain additional 
information. The AL classifier consistently performed better than traditional implementations. 


IV. Future Big Data Challenges in Time Domain Astronomy 

In this section we describe the future big data challenges in TDA from the viewpoint of 
computational intelligence and machine learning using as an example the LSST. The astronomical 
research problems targeted by the LSST are described in the LSST Science Book @. The LSST 
focuses on time-domain astronomy, and will be able to provide a movie of the entire sky in the 
southern hemisphere for the first time. Some of the LSST challenges are: detecting faint transient 
signals such as supernovae with a low false-positive rate, classifying transients, estimating the 
distance from Earth, and making discoveries and classification in real-time iTT^ . Facilities such 
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(a) (b) 

Fig. 5. (a) Semi-supervised learning scheme used in m for Supernovae classification, (b) An active learning approach to 

building training sets for supervised classification of variable stars. Samples from the testing set are moved (unsupervised) to the 
training sets reducing sample selection bias. The expert is queried by the method if more information is needed for obtaining 
the new labels. 


as the LSST will produce a paradigm change in astronomy. On the one hand the new telescope 
will be entirely dedicated to a large-scale survey of the sky, and individual astronomers will not 
be allowed to make private observations as they used to do in the past iTT^ . On the other hand, 
the data volume and its flow rate would be so large that most of the process should be done 
automatically using robotic telescopes and automated data analysis. Data volumes from large 
sky surveys will grow from Terabytes during this decade (e.g., PanSTARRS [|5]|) to hundreds 
of Petabytes during the next decade (LSST [lEl, BH). The final LSST image archive will be ~ 
150 Petabytes and the astronomical object catalog (object-attribute database) is expected to be 
~ 40 Petabytes, comprising 200 attributes for 50 billion objects lITOll . In lim the following three 
challenges are identified for the LSST: a) Mining a massive data stream of ~ 2 Terabytes per 
hour in real time for 10 years, b) classifying more than 50 billion objects and following up many 
of these events in real time, c) extracting knowledge in real time for ~ 2 million events per 
night. The analysis of astronomical data involves many stages, and in all of them it is possible 
to use Cl techniques to help its automation. In this paper we focused on the analysis of the light 
curves, but there are other Cl challenges in image acquisition & processing IH, dimensionality 
reduction and feature selection Il471l . [|48l . etc. There are also major technical challenges related 
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to the astronomical instruments, storage, networking and processing facilities. The challenges 
associated with data management and the proposed engineering solutions are described in |j8l. 

The LSST will be dedicated exclusively to the survey program, thus follow-up observations 
(light curves, spectroscopy, multiple wavelengths), which are scientifically essential, must be 
done by other facilities around the world [fTOl . With this goal the LSST will generate millions of 
event alerts during each night for 10 years. Many of the observed phenomena are transient events 
such as supemovae, gamma-ray bursts, gravitational microlensing events, planetary occultations, 
stellar flares, accretion flares from supermassive black holes, asteroids, etc. [|49l . A key challenge 
is that the data need to be processed as it streams from the telescopes, comparing it with the 
previous images of the same parts of the sky, automatically detecting any changes, and classifying 
and prioritizing the detected events for rapid follow-up observations [l50ll . The system should 
output a probability of any given event as belonging to any of the possible known classes, or as 
being unknown. An important requirement is maintaining high level of completeness (do not miss 
any interesting events) with a low false alarm rate, and the capacity to learn from past experience 
Il49l . The classification must be updated dynamically as more data come in from the telescope 
and the feedback arrives from the follow-up facilities. Another problem is determining what 
follow-up observations are the most useful for improving classification accuracy, and detecting 
objects of scientific interest. In [ISTll maximizing the conditional mutual information is proposed. 

Tackling the future challenges in astronomy will require the cooperation of scientists working 
in the fields of astronomy, statistics, informatics and machine leaming/computational intelligence 
d. In fact the fields of astroinformatics [fTSll and astrostatistics have been recently created to 
deal with the challenges mentioned above. Astroinformatics is the new data-oriented paradigm 
for astronomy research and education, which includes data organization, data description, tax¬ 
onomies, data mining, knowledge discovery, machine learning, visualization and statistics flTOl . 

The characterization (unsupervised learning) and classification (supervised learning) of massive 
datasets are identified as major research challenges [fTOl . For time-domain astronomy the rapid 
detection, characterization and analysis of interesting phenomena and emergent behavior in 
high-rate data streams are critical aspects of the science [fTOl . Unsupervised learning and semi- 
supervised learning are believed to play a key role in new discoveries. To deal with big data in 
TDA in the Peta-scale era the following open problems need to be solved: 

1) Developing very efficient algorithms for large-scale astroinformatics/astrostatistics. Fast 
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algorithms for commonly used operations in astronomy are deseribed in ||52]| : e.g., all nearest 
neighbors, n-point eorrelation, Euelidean minimum spanning tree, kernel density estimation 
(KDE), kernel regression, and kernel diseriminant analysis. iV-point eorrelations are used to 
eompare the spatial strueture of two data sets, e.g., luminous red galaxies in the Sloan digital 
sky survey [|53]| . KDE is used for eomparing the distributions of different kinds of objeets. Most 
of these algorithms involve distanee eomparisons between all data pairs, and therefore are naively 
0{N‘^) or of even higher eomplexity. With the goal of aehieving linear or 0{N log N) runtimes 
for pair-distanee problems, spaee-partitioning tree data struetures sueh as fcd-trees are used, in 
a divide and eonquer approaeh. In the KDE problem series expansions for sums of kernels 
funetions are truneated to approximate eontinuous funetions of distanee. In [|54ll it is argued 
that the algorithms should be effieient in three respeets: eomputational (number of eomputations 
done), statistieal (number of samples required for good generalization), and human involvement 
(amount of human labor to tailor the algorithm to a task). The authors state that there are 
fundamental limitations for eertain elasses of learning algorithms, e.g., kernel methods. These 
limitations eome from their shallow strueture (single layered) whieh ean be very ineffieient 
in representing eertain types of funetions, and from using loeal estimators whieh suffer the 
eurse of dimensionality. Contrarily, deep arehiteetures, whieh are eompositions of many layers 
of adaptive nonlinear eomponents, e.g., multilayer neural networks with several hidden layers, 
have the potential to generalize in nonloeal ways. In [f55ll a layer-by-layer unsupervised learning 
algorithm for deep struetures was proposed, opening a new line of researeh that is still on-going. 

2) Developing effective statistical tools for dealing with big data. The large data sample and 
high dimensionality eharaeteristies of big data, raise three statistieal ehallenges l(56ll : i) noise 
aeeumulation, spurious eorrelations, and ineidental endogeneity (residual noise is eorrelated with 
the predietors), ii) heavy eomputational eost and algorithmie instability, iii) heterogeneity, statis¬ 
tieal biases. Dimension reduetion and variable seleetion are key for analyzing high dimensional 
data [[5^ . [l47l . [|48l . Noise aeeumulation ean be redueed by using the sparsity assumption. 

3) Creating implementations targeted for High-Performance Computing (HPC) architec¬ 
tures. Traditional analysis methods used in astronomy do not seale to peta-seale volumes of 
data on a single eomputer [[57l . One eould rework the algorithms to improve eomputational 
effieieney but even this might prove to be insuffieient with the new surveys. An alternative is to 
deeompose the problem into several independent sub-problems. Computations ean then proeeed 


September 28, 2015 


DRAFT 



20 


in parallel over a shared memory eluster, a distributed memory eluster, or a eombination of 
both. In a shared memory eluster the proeesses launehed by the user ean eommunieate and share 
data and results through memory. In a distributed environment eaeh proeessor reeeives data and 
instruetions, performs the eomputations and reports the results baek to the main server. The 
number of proeessors per node, amount of shared memory and network speed have to be taken 
into aeeount when implementing an algorithm for HPC arehiteetures. Effieieney will ultimately 
depend on how separable the problem is in the first plaee. Another parallel eomputing strategy 
involves the use of GPUs (graphieal proeessing units) instead or side by side with eonventional 
CPUs (eentral proeessing units). GPGPU (general purpose eomputing in GPU) is a relatively new 
paradigm for highly parallel applieations in whieh high-eomplexity ealeulations are offloaded to 
the GPU (eoproeessor). GPUs are inherently parallel harnessing up to 2,500 proeessing eore^ 
The proeessing power and relatively low eost of GPUs have made them popular in the HPC 
eommunity and their availability has been on the rise [[58]| . Note that explieit thread and data 
parallelism must be exploited in order to get the theoretieal speed-ups of GPUs over CPUs. 
Dedieated hardware based on FPGAs may provide interesting speed-ups for TDA algorithms 
||59l . However due to the advaneed teehnieal knowledge required to use them, FPGAs are not as 
popular in astronomy as the HPC resourees already presented. Interdiseiplinary eollaborations 
between eleetrieal & eomputer engineers and astronomers might ehange this in the near future. An 
existing non-parallel algorithm ean be extended using the MapReduee [|60l model for distributed 
eomputing, a model inspired from funetional programming. Programs written in this funetional 
style are automatieally parallelized. In [1^ the MapReduee model was used to develop distributed 
and massively parallel implementations of fc-means, support veetor maehines, neural networks. 
Naive Bayes, among others. In the eloud eomputing paradigm the hardware resourees (proeessors, 
memory and storage) are almost entirely abstraeted and ean be inereased/deereased by the 
user on demand. Distributed models sueh as MapReduee have a high synergy with the eloud 
eomputing paradigm. Cloud eomputing serviees sueh as Amazon EC2 provide eost-effeetive 
HPC solutions for eompute and/or memory bound seientifie applieations as shown in [|6^ . As 
pointed out in [|63l, one of the biggest advantages of implementing astronomieal pipelines using 
eloud serviees is that the eomputing resourees ean be sealed rather easily and aeeording to 

^NVIDIA Tesla K20 module. 
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changing workloads. The granular computing paradigm [|64l is also of interest for astronomical 
big data applications. In granular computing the information extracted from data is modeled 
as a hierarchical structure across different levels of detail or scales. This can help to compress 
the information and reduce the dimensionality of the problem. Another approach is the virtual 
observatory (VO), a cyberinfrastucture for discovery and access to all distributed astronomical 
databases m- A useful data portal for data mining is OpenSkyQuery ll^ . which allows users 
to do multi-database queries on many astronomical object catalogs. 

4) Developing fast algorithms for online event detection and discrimination. Several fa¬ 
cilities around the world will follow the 2 million events that the LSST will issue each night. 
These facilities will need to decide which events are most relevant so as not to waste their 
limited observing and storage resources. In addition, these decisions have to be made as fast as 
possible to avoid missing important data. Pattern recognition methods to quickly analyze and 
discriminate interesting phenomena from the streamed data are needed. These methods should 
update their results online and return an associated statistical confidence that increases as more 
data is retrieved from the LSST. It is critical not to miss any relevant event while keeping the 
contamination from false positives as low as possible. Additionally, these methods should learn 
from past experience and adapt depending on the previously selected events. Designing methods 
that comply with these requirements is currently an open problem. 

V. Concluding Remarks 

In a few years the LSST will be fully operational capturing the light of billions of astronomical 
objects, and generating approximately two million events each night for ten years. The LSST 
team itself, and multiple external facilities around the world, will follow and study these events. 
The main objectives are to characterize and classify the transient phenomena arising from the 
moving sky. Additionally, it is expected that a plethora of scientific discoveries will be made. If 
the right tools are used, science would be produced at rates without precedent. 

Conventional astronomy is not prepared for this deluge of observational data and hence a 
paradigm shift in TDA has been observed. Astronomy, statistics and machine learning have 
been combined in order to produce science that can provide automated methods to deal with the 
soon to come synoptic surveys. Computational intelligence methods for pattern recognition are 
essential for the proper exploitation of synoptic surveys, being able to detect and characterize 
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events that otherwise might not even be noticed by human investigators. 

In this review we have studied several machine learning based implementations proposed 
to solve current astronomical problems. The particular challenges faced when applying machine 
learning methods in TDA include 1) the design of representative training sets, 2) the combination 
and reuse of training databases for new surveys, 3) the definition of feature vectors from domain 
knowledge, 4) the design of fast and scalable computational implementations of the methods in 
order to process the TDA databases within feasible times, and finally, 5) the sometimes difficult 
interpretation of the results obtained and the question of how to gain physical insight from them. 

The quality of a training set is critical for the correct performance of supervised methods. 
In astronomy an intrinsic sample selection bias occurs when knowledge gathered from previous 
surveys is used with new data. Semi-supervised learning and active learning rise as feasible 
options to cope with large and heterogeneous astronomical data, providing particular solutions 
to the dilemmas regarding training sets. It is very likely that we will see more semi-supervised 
applications for astronomy in the near future. The reuse of training sets is critical in terms of 
scalability and validity of results. The integration with existing databases and the incorporation 
of data observed at different wavelengths are currently open issues. Feature spaces that are 
survey-independent may provide an indirect solution to the combination of training sets and the 
applications of trained classifiers across different surveys. 

Although powerful, the sometimes extended calibration required by machine learning methods 
can be difficult for inexperienced users. The selection of the algorithms, the complexity of the 
implementations, the exploration of parameter space, and the interpretation of the outputs in 
physical terms are some of the issues one has to face when using machine learning methods. 
The learning curve might be too steep for an astronomer to take the initiative, but all the issues 
named here can be solved by inter-disciplinary collaboration. Teams assembled from the fields of 
astronomy, statistics, computer science and engineering have everything that is needed to propose 
solutions for data-intensive TDA. The deluge of astronomical data opens up huge opportunities 
for professionals with knowledge in computational intelligence and machine learning. 
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